From Deterministic to Generative: Multi-Modal Stochastic RNNs for Video Captioning

نویسندگان

Jingkuan Song

Yuyu Guo

Lianli Gao

Xuelong Li

Alan Hanjalic

Heng Tao Shen

چکیده

Video captioning in essential is a complex natural process, which is affected by various uncertainties stemming from video content, subjective judgment, etc. In this paper we build on the recent progress in using encoder-decoder framework for video captioning and address what we find to be a critical deficiency of the existing methods, that most of the decoders propagate deterministic hidden states. Such complex uncertainty cannot be modeled efficiently by the deterministic models. In this paper, we propose a generative approach, referred to as multi-modal stochastic RNNs networks (MS-RNN), which models the uncertainty observed in the data using latent stochastic variables. Therefore, MS-RNN can improve the performance of video captioning, and generate multiple sentences to describe a video considering different random factors. Specifically, a multimodal LSTM (M-LSTM) is first proposed to interact with both visual and textual features to capture a high-level representation. Then, a backward stochastic LSTM (S-LSTM) is proposed to support uncertainty propagation by introducing latent variables. Experimental results on the challenging datasets MSVD and MSR-VTT show that our proposed MS-RNN approach outperforms the state-of-the-art video captioning benchmarks.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Deep Generative Model for Disentangled Representations of Sequential Data

We present a VAE architecture for encoding and generating high dimensional sequential data, such as video or audio. Our deep generative model learns a latent representation of the data which is split into a static and dynamic part, allowing us to approximately disentangle latent time-dependent features (dynamics) from features which are preserved over time (content). This architecture gives us ...

متن کامل

Stochastic Video Prediction with Conditional Density Estimation

Frame-to-frame stochasticity is a major challenge in video prediction. The use of standard feedforward and recurrent networks for video prediction leads to averaging of future states, which can in part be attributed to the networks’ limited ability to model stochasticity. We propose the use of conditional variational autoencoders (CVAE) for video prediction. To model multi-modal densities in fr...

متن کامل

Using a new modified harmony search algorithm to solve multi-objective reactive power dispatch in deterministic and stochastic models

The optimal reactive power dispatch (ORPD) is a very important problem aspect of power system planning and is a highly nonlinear, non-convex optimization problem because consist of both continuous and discrete control variables. Since the power system has inherent uncertainty, hereby, this paper presents both of the deterministic and stochastic models for ORPD problem in multi objective and sin...

متن کامل

Automated Image Captioning Using Nearest-Neighbors Approach Driven by Top-Object Detections

The significant performance gains in deep learning coupled with the exponential growth of image and video data on the Internet have resulted in the recent emergence of automated image captioning systems. Two broad paradigms have emerged in automated image captioning, i.e., generative model-based approaches and retrieval-based approaches. Although generative model-based approaches that use the r...

متن کامل

Sequence to Sequence Model for Video Captioning

Automatically generating video captions with natural language remains a challenge for both the field of nature language processing and computer vision. Recurrent Neural Networks (RNNs), which models sequence dynamics, has proved to be effective in visual interpretation. Based on a recent sequence to sequence model for video captioning, which is designed to learn the temporal structure of the se...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1708.02478 شماره

صفحات -

تاریخ انتشار 2017

From Deterministic to Generative: Multi-Modal Stochastic RNNs for Video Captioning

نویسندگان

چکیده

منابع مشابه

A Deep Generative Model for Disentangled Representations of Sequential Data

Stochastic Video Prediction with Conditional Density Estimation

Using a new modified harmony search algorithm to solve multi-objective reactive power dispatch in deterministic and stochastic models

Automated Image Captioning Using Nearest-Neighbors Approach Driven by Top-Object Detections

Sequence to Sequence Model for Video Captioning

عنوان ژورنال:

اشتراک گذاری